Python - pandas Data Integration (concat, join, merge)
Table of Contents
This article explains methods for integrating multiple datasets into one using the pandas library.
There are several methods for integrating data, but this time we will cover concat, join, merge.
Before we explain, let’s create two example data frames.
>>> import pandas as pd
>>> df1 = pd.DataFrame({
'Class1' : [95, 92, 98, 100],
'Class2' : [91, 93, 97, 99]
})
>>> df2 = pd.DataFrame({
'Class1' : [87, 89],
'Class2' : [85, 90]
})
Output of d1:
Class1 | Class2 | |
---|---|---|
0 | 87 | 85 |
1 | 89 | 90 |
Output of d2:
Class1 | Class2 | |
---|---|---|
0 | 95 | 91 |
1 | 92 | 93 |
2 | 98 | 97 |
3 | 100 | 99 |
1. concat #
The concat function of the pandas library is used to append data frames. This function can append multiple data frames in either row or column direction.
Let’s append df1 and df2 to the result.
>>> result = pd.concat([df1, df2])
>>> result
Class1 | Class2 | |
---|---|---|
0 | 95 | 91.0 |
1 | 92 | 93.0 |
2 | 98 | 97.0 |
3 | 100 | 99.0 |
4 | 87 | 85.0 |
5 | 89 | 90.0 |
6 | 96 | NaN |
7 | 83 | NaN |
pd.concat([df1, df2]) appends df1 and df2 in the row direction. That is, the two data frames are connected vertically.
Next, let’s integrate d3, which only has the Class1 column, into the result.
>>> df3 = pd.DataFrame({
'Class1' : [96, 83]
})
>>> pd.concat([result, df3], ignore_index=True)
Class1 | Class2 | |
---|---|---|
0 | 95 | 91.0 |
1 | 92 | 93.0 |
2 | 98 | 97.0 |
3 | 100 | 99.0 |
4 | 87 | 85.0 |
5 | 89 | 90.0 |
6 | 96 | NaN |
7 | 83 | NaN |
Since the d3 data does not have the ‘Class2’ column, it outputs blank values.
2. join #
The join method of the pandas library is used to combine two data frames based on a specific column. It generally performs a role similar to SQL’s JOIN operation. Unlike concat, join integrates horizontally.
>>> df4 = pd.DataFrame({
'Class3' : [93, 91, 95, 98]
})
>>> df1.join(df4)
Class1 | Class2 | Class3 | |
---|---|---|---|
a | 95 | 91 | 93 |
b | 92 | 93 | 91 |
c | 98 | 97 | 95 |
d | 100 | 99 | 98 |
It is also possible to output by arbitrarily setting the index as follows.
>>> index_label = ['a','b','c','d']
>>> df1a = pd.DataFrame({'Class1': [95, 92, 98, 100],
'Class2': [91, 93, 97, 99]}, index= index_label)
>>> df4a = pd.DataFrame({'Class3': [93, 91, 95, 98]}, index=index_label)
>>> df1a.join(df4a)
Class1 | Class2 | Class3 | |
---|---|---|---|
a | 95 | 91 | 93 |
b | 92 | 93 | 91 |
c | 98 | 97 | 95 |
d | 100 | 99 | 98 |
3. merge #
The merge function of the pandas library is used to merge (integrate) two data frames based on a specific column. Using the merge function, you can combine data frames based on common columns between them.
>>> df_A_B = pd.DataFrame({'Sales Month': ['January', 'February', 'March', 'April'],
'Product A': [100, 150, 200, 130],
'Product B': [90, 110, 140, 170]})
>>> df_C_D = pd.DataFrame({'Sales Month': ['January', 'February', 'March', 'April'],
'Product C': [112, 141, 203, 134],
'Product D': [90, 110, 140, 170]})
df_A_B
Sales Month | Product A | Product B | |
---|---|---|---|
0 | January | 100 | 90 |
1 | February | 150 | 110 |
2 | March | 200 | 140 |
3 | April | 130 | 170 |
df_C_D
Sales Month | Product C | Product D | |
---|---|---|---|
0 | January | 112 | 90 |
1 | February | 141 | 110 |
2 | March | 203 | 140 |
3 | April | 134 | 170 |
Use merge to merge the two data frames based on the ‘Sales Month’ column. As a result, the two data frames are combined based on the ‘Sales Month’ column, and the data is organized around the common columns.
>>> df_A_B.merge(df_C_D)
Sales Month | Product A | Product B | Product C | Product D | |
---|---|---|---|---|---|
0 | January | 100 | 90 | 112 | 90 |
1 | February | 150 | 110 | 141 | 110 |
2 | March | 200 | 140 | 203 | 140 |
3 | April | 130 | 170 | 134 | 170 |
Let’s implement four different ways to combine two data frames using the merge method.
>>> df_left = pd.DataFrame({'key':['A','B','C'], 'left': [1, 2, 3]})
>>> df_right = pd.DataFrame({'key':['A','B','D'], 'right': [4, 5, 6]})
1.
>>> df_left.merge(df_right, how='left', on = 'key')
key | left | right | |
---|---|---|---|
0 | A | 1 | 4.0 |
1 | B | 2 | 5.0 |
2 | C | 3 | NaN |
Left join df_left and df_right based on the ‘key’ column. A left join keeps all rows from the left data frame (df_left) and adds rows from the right data frame (df_right) that have a matching key value. If a matching key value is not present in the right data frame, it is filled with NaN.
2.
>>> df_left.merge(df_right, how='right', on = 'key')
key | left | right | |
---|---|---|---|
0 | A | 1.0 | 4 |
1 | B | 2.0 | 5 |
2 | D | NaN | 6 |
Right join df_left and df_right based on the ‘key’ column. A right join keeps all rows from the right data frame (df_right) and adds rows from the left data frame (df_left) that have a matching key value. If a matching key value is not present in the left data frame, it is filled with NaN.
3.
>>> df_left.merge(df_right, how='outer', on = 'key')
key | left | right | |
---|---|---|---|
0 | A | 1.0 | 4.0 |
1 | B | 2.0 | 5.0 |
2 | D | 3.0 | NaN |
3 | D | NaN | 6.0 |
Outer join df_left and df_right based on the ‘key’ column. An outer join includes all rows from both data frames, filling with NaN where a match is only present in one of the data frames.
4.
>>> df_left.merge(df_right, how='inner', on = 'key')
key | left | right | |
---|---|---|---|
0 | A | 1 | 4 |
1 | B | 2 | 5 |
Inner join df_left and df_right based on the ‘key’ column. An inner join includes only rows that are common to both data frames. That is, it combines rows from both data frames that have the same ‘key’ value.